library(readr)
bc = read_delim("r1.withq.barcodes", delim=" ")
## 
|===========                                                                    |  14%   36 MB
|============                                                                   |  15%   39 MB
|=============                                                                  |  16%   42 MB
|==============                                                                 |  18%   45 MB
|===============                                                                |  19%   48 MB
|================                                                               |  20%   52 MB
|=================                                                              |  21%   55 MB
|==================                                                             |  23%   58 MB
|===================                                                            |  24%   61 MB
|====================                                                           |  25%   64 MB
|=====================                                                          |  26%   67 MB
|======================                                                         |  27%   70 MB
|=======================                                                        |  29%   73 MB
|========================                                                       |  30%   76 MB
|=========================                                                      |  31%   79 MB
|==========================                                                     |  32%   82 MB
|===========================                                                    |  34%   85 MB
|============================                                                   |  35%   88 MB
|=============================                                                  |  36%   91 MB
|==============================                                                 |  37%   94 MB
|===============================                                                |  38%   97 MB
|================================                                               |  40%  101 MB
|=================================                                              |  41%  104 MB
|==================================                                             |  42%  107 MB
|==================================                                             |  43%  110 MB
|===================================                                            |  44%  113 MB
|====================================                                           |  46%  116 MB
|=====================================                                          |  47%  119 MB
|======================================                                         |  48%  122 MB
|=======================================                                        |  49%  125 MB
|========================================                                       |  51%  128 MB
|=========================================                                      |  52%  131 MB
|==========================================                                     |  53%  134 MB
|===========================================                                    |  54%  137 MB
|============================================                                   |  55%  140 MB
|=============================================                                  |  57%  143 MB
|==============================================                                 |  58%  146 MB
|===============================================                                |  59%  150 MB
|================================================                               |  60%  153 MB
|=================================================                              |  61%  156 MB
|==================================================                             |  63%  159 MB
|===================================================                            |  64%  162 MB
|====================================================                           |  65%  165 MB
|=====================================================                          |  66%  168 MB
|======================================================                         |  68%  171 MB
|=======================================================                        |  69%  174 MB
|========================================================                       |  70%  177 MB
|=========================================================                      |  71%  180 MB
|==========================================================                     |  72%  183 MB
|===========================================================                    |  74%  186 MB
|============================================================                   |  75%  189 MB
|=============================================================                  |  76%  192 MB
|==============================================================                 |  77%  195 MB
|===============================================================                |  78%  199 MB
|================================================================               |  80%  202 MB
|=================================================================              |  81%  205 MB
|==================================================================             |  82%  208 MB
|===================================================================            |  83%  211 MB
|====================================================================           |  85%  214 MB
|====================================================================           |  86%  217 MB
|=====================================================================          |  87%  220 MB
|======================================================================         |  88%  223 MB
|=======================================================================        |  89%  226 MB
|========================================================================       |  91%  229 MB
|=========================================================================      |  92%  232 MB
|==========================================================================     |  93%  235 MB
|===========================================================================    |  94%  238 MB
|============================================================================   |  95%  241 MB
|=============================================================================  |  97%  244 MB
|============================================================================== |  98%  247 MB
|===============================================================================|  99%  251 MB
|================================================================================| 100%  252 MB
library(dplyr)
grouped = bc %>% group_by(barcode, index) %>% summarise(count=n(), bcq=mean(barcodeq),
                                                        iq=mean(indexq))

Summary stats

There are 65778 unique barcode-index pairs and of those 5838 are seen more than 10 times.

We can see below that there is are a small set of barcode-index pairs that are represented a high number of times.

library(ggplot2)
ggplot(grouped, aes(count)) + geom_histogram() + theme_bw() +
  scale_x_log10() + scale_y_sqrt()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Known barcodes

Now we’ll mark up the barcode-index pair with whether or not these are some of the known sequences. The indexes we found correspond to the reverse complement column in the index file and the barcodes correspond to the sense column in the barcode file.

kbc = read_csv("metadata/BC sequences.csv")
kin = read_csv("metadata/indexseqs.csv")
grouped$kin = grouped$index %in% kin$rc
grouped$kbc = grouped$barcode %in% kbc$sense
grouped$known = grouped$kin & grouped$kbc
grouped$known = ifelse(grouped$known, "known", "unknown")
expected = read_csv("metadata/expected-pairs.csv")
colnames(expected) = c("ebarcode", "eindex")
expected$pair = paste(expected$ebarcode, expected$eindex, sep=":")
grouped$pair = paste(grouped$barcode, grouped$index, sep=":")
grouped$expected = grouped$pair %in% expected$pair
known = subset(grouped, known == "known")

I wrote out the barcode-index pairs with counts, the mean of the barcode quality (bcq), the mean of the index quality (iq), whether or not the index is a known index (kin), whether or not the barcode is a known barcode (kin), and whether or not the barcode-index pair is expected (known).

This table is here All barcode index

write.table(grouped, file="all-barcode-index-unfiltered.csv", col.names=TRUE, row.names=FALSE, quote=FALSE, sep=",")

Now we can see if we mark up the barcode-index pairs with whether or not they are from the known barcode and known index files, we find that there are only 940 barcode-index pairs.

If we plot the histograms separating out whether they are a known or unknown barcode-index pair, we can see that the known barcode-index pairs have many more reads associated with them than the unknown barcode-index pairs.

library(ggplot2)
ggplot(grouped, aes(count)) + geom_histogram() + theme_bw() +
  facet_wrap(~known) +
  scale_x_log10() + scale_y_sqrt() +
  ylab("number of barcode-index pairs")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that the PHRED quality of the known barcode-index pairs has a tight distribution:

ggplot(grouped, aes(bcq, iq)) + geom_point() + theme_bw() +
  facet_wrap(~known) +
  ylab("mean PHRED quality of index") +
  xlab("mean PHRED quality of barcode")

Here you can see that if we filter for only known barcodes and indexes, they tend to have both a high index and barcode quality compared to unknown sequences.

ggplot(grouped, aes(bcq, iq, color=known)) + geom_point(size=0.5) + theme_bw() +
  facet_wrap(~kin+kbc) +
  ylab("mean PHRED quality of index") +
  xlab("mean PHRED quality of barcode")

If we look at the expected barcode-index pairings, we can see that there is no difference in the distribution of quality of the barcode-index pairs for the pairs we expect (TRUE) and the pairs we do not expect (FALSE). Filtering on quality will not help these results.

ggplot(subset(grouped, known == "known"), aes(bcq, iq)) +
  geom_point(size=0.5) + theme_bw() +
  facet_wrap(~expected) +
  ylab("mean PHRED quality of index") +
  xlab("mean PHRED quality of barcode")

We can see that the barcode-index pairs have about an order of magnitude less counts when they are not expected than when they are expected.

ggplot(subset(grouped, known == "known"), aes(count)) +
  geom_histogram() + theme_bw() +
  facet_wrap(~expected) +
  scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Barcodes per index

We’re expecting 2 to 3 barcodes per index, but we generally see more than that if we’re seeing 940 barcodes. How many per barcode?

barcode_per_index = known %>% group_by(index) %>% summarise(nbarcodes = n())
index_per_barcode = known %>% group_by(barcode) %>% summarise(nindex = n())
ggplot(barcode_per_index, aes(index, nbarcodes)) +
  geom_bar(stat='identity') +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("")

ggplot(index_per_barcode, aes(barcode, nindex)) + geom_bar(stat='identity') +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("")

Overrepresented barcodes

We see every index with every barcode at least once. Are some overrepresented?

ggplot(known, aes(barcode, count)) + facet_wrap(~index) + geom_bar(stat='identity') +
  scale_y_sqrt() +
  theme_bw() +
  theme(axis.ticks = element_blank(), axis.text.x = element_blank())

For some indexes it looks like yes, and for others no.

Overepresented barcodes colored by quality

ggplot(known, aes(barcode, count, fill=iq)) + facet_wrap(~index) +
  geom_bar(stat='identity') +
  scale_y_sqrt() +
  theme_bw() +
  theme(axis.ticks = element_blank(), axis.text.x = element_blank()) +
  guides(fill=guide_legend(title="Index quality"))

ggplot(known, aes(barcode, count, fill=bcq)) + facet_wrap(~index) +
  geom_bar(stat='identity') +
  scale_y_sqrt() +
  theme_bw() +
  theme(axis.ticks = element_blank(), axis.text.x = element_blank()) +
  guides(fill=guide_legend(title="Barcode quality"))

ggplot(known, aes(iq, bcq)) + facet_wrap(~index) +
  geom_point() + xlab("index quality") +
  ylab("barcode quality") + theme_bw()

Filtering for barcode pairs

Here I filtered by keeping only the barcodes for each index which were at least twice as prevalent as the average of the barcodes for that index. Then I filtered out all indexes that had more than 3 barcodes that passed that filter.

filtered = known %>% group_by(index) %>%
  mutate(mcount = mean(count)) %>%
  filter(count > 2*mcount) %>% group_by(index) %>%
  mutate(nfiltered = n()) %>%
  filter(nfiltered < 4)

That leaves us with these:

ggplot(filtered, aes(barcode, count)) + facet_wrap(~index) +
  geom_bar(stat='identity') + scale_y_sqrt() +
  theme_bw() +
  theme(axis.ticks = element_blank(), axis.text.x = element_blank())

Expected indexes

expected = read_csv("metadata/expected-pairs.csv")
colnames(expected) = c("ebarcode", "eindex")
expected$pair = paste(expected$ebarcode, expected$eindex, sep=":")
filtered$pair = paste(filtered$barcode, filtered$index, sep=":")
filtered$expected = filtered$pair %in% expected$pair
ggplot(filtered, aes(barcode, count, fill=expected)) + facet_wrap(~index) +
  geom_bar(stat='identity') + scale_y_sqrt() +
  theme_bw() +
  theme(axis.ticks = element_blank(), axis.text.x = element_blank())

output = filtered %>% select(barcode, index, count)
write.table(output, file="barcode-index.csv", col.names=TRUE,
            sep=",", quote=FALSE, row.names=FALSE)

Expected on unfiltered list

known$pair = paste(known$barcode, known$index, sep=":")
known$expected = known$pair %in% expected$pair
ggplot(known, aes(barcode, count, fill=expected)) +
  facet_wrap(~index) + geom_bar(stat='identity') +
  scale_y_sqrt() +
  theme_bw() +
  theme(axis.ticks = element_blank(), axis.text.x = element_blank())

33 barcode mixes

index_mixes = c("ATAACGGT", "TCAGCATT", "TGTGACTA")
ggplot(subset(known, index %in% index_mixes),
       aes(barcode, count, fill=expected)) +
  facet_wrap(~index) + geom_bar(stat='identity') +
  scale_y_sqrt() +
  theme_bw() +
  theme(axis.ticks = element_blank(), axis.text.x = element_blank())

Coloring by barcode and index quality:

ggplot(subset(known, index %in% index_mixes),
       aes(barcode, count, color=iq)) +
  facet_wrap(~index+expected) + geom_point() +
  scale_y_sqrt() +
  theme_bw() +
  theme(axis.ticks = element_blank(), axis.text.x = element_blank())

ggplot(subset(known, index %in% index_mixes),
       aes(barcode, count, color=bcq)) +
  facet_wrap(~index+expected) + geom_point() +
  scale_y_sqrt() +
  theme_bw() +
  theme(axis.ticks = element_blank(), axis.text.x = element_blank())

Filtering out by barcode quality might help with some but doesn’t weed out all of the false positives.

Output tables

output = filtered %>% select(barcode, index, count, expected)
write.table(output, file="filtered-barcode-index.csv", col.names=TRUE,
            sep=",", quote=FALSE, row.names=FALSE)
output = known %>% select(barcode, index, count, expected)
write.table(output, file="unfiltered-barcode-index.csv", col.names=TRUE,
            sep=",", quote=FALSE, row.names=FALSE)

filtered-barcode-index.csv

unfiltered-barcode-index.csv

I included an example of a CSV file of the expected pairs that is much easier to work with:

expected-pairs.csv

Raw barcode-index counts

This file has this format: the mean of the barcode quality (bcq), the mean of the index quality (iq), whether or not the index is a known index (kin), whether or not the barcode is a known barcode (kbc), whether or not both the barcode and index pair were known (known) and whether or nor the index-barcode pair was expected (expected). Along with the sequence of the barcode (barcode), the sequence of the index (index) and the number of reads that had that barcode-index pair (count).

All barcode index